Make op_upsample_bilinear2d_aa_test deterministic#19357
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19357
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 20 PendingAs of commit f2388dd with merge base b3baac5 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104150928. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR addresses test flakiness in the portable anti-aliased bilinear upsample parity tests by making their randomized inputs deterministic, and updates the iOS LLM Apple BUCK rule’s test timeout/label configuration to accommodate very long-running XCTest inference runs.
Changes:
- Seed PyTorch RNG in
UpsampleBilinear2dAATestto makerandn/randintinputs deterministic across runs. - Relax
atolfor two specific flaky parity tests to match the observed worst-case divergence under the pinned seed. - Update Apple LLM BUCK test labeling/timeouts to allow significantly longer XCTest execution.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| kernels/portable/test/op_upsample_bilinear2d_aa_test.py | Pins RNG for deterministic inputs and adjusts tolerances to eliminate flakes in ATen-vs-ExecuTorch parity checks. |
| extension/llm/apple/BUCK | Changes test label and increases overall test rule timeout to support long-running iOS-sim LLM inference tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def setUp(self) -> None: | ||
| # Pin RNG so torch.randn / torch.randint inputs are deterministic. | ||
| # Without this, the parity tests below occasionally see input values | ||
| # that produce ATen-vs-ExecuTorch differences just above the | ||
| # configured atol, surfacing as flakes on the test-issues dashboard. | ||
| torch.manual_seed(0) | ||
|
|
Summary:
Three test methods in
`fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py`
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):
- test_upsample_bilinear2d_aa_aten_parity_u8
- test_upsample_bilinear2d_aa_aggressive_downsampling
- test_upsample_bilinear2d_aa_align_corners_downsampling
Root cause: each test builds its input via `torch.randint(...)` or
`torch.randn(...)` with no seed pinned, so each run sees a different
sample. The configured `atol` was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.
Fix:
1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same
input tensor and the same divergence, eliminating the run-to-run
FAIL/PASS oscillation.
2. Bump two atol thresholds to cover the worst-case observed
divergence with the now-pinned input:
- u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
- aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
~0.59 for N(0,1) input)
3. The pre-existing `atol=0.25` on align_corners_downsampling is left
unchanged - with seed 0 it now passes consistently.
The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the
kernel under tighter constraints.
Differential Revision: D104150928
bd692d8 to
bdfcb80
Compare
|
@claude review this PR and the comment from Copilot , seems reasonable |
|
Claude finished @psiddh's task in 1m 3s —— View job Review of PR #19357
Overall: This PR looks good. The fix is straightforward and the approach (pinning the seed + adjusting tolerances) is the standard pattern for making flaky numerical-parity tests deterministic. On Copilot's suggestion (RNG state save/restore)Copilot's suggestion to save/restore RNG state via
So the current approach of just Minor observations
LGTM — no blocking concerns. |
Summary:
Three test methods in
`fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py`
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):
- test_upsample_bilinear2d_aa_aten_parity_u8
- test_upsample_bilinear2d_aa_aggressive_downsampling
- test_upsample_bilinear2d_aa_align_corners_downsampling
Root cause: each test builds its input via `torch.randint(...)` or
`torch.randn(...)` with no seed pinned, so each run sees a different
sample. The configured `atol` was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.
Fix:
1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same
input tensor and the same divergence, eliminating the run-to-run
FAIL/PASS oscillation.
2. Bump two atol thresholds to cover the worst-case observed
divergence with the now-pinned input:
- u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
- aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
~0.59 for N(0,1) input)
3. The pre-existing `atol=0.25` on align_corners_downsampling is left
unchanged - with seed 0 it now passes consistently.
The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the
kernel under tighter constraints.
Reviewed By: rascani
Differential Revision: D104150928
| def setUp(self) -> None: | ||
| # Save RNG state so we can restore it in tearDown; without this, | ||
| # `torch.manual_seed` would leak determinism into other test | ||
| # modules that share the same process. | ||
| self._torch_rng_state = torch.get_rng_state() |
Summary:
Three test methods in
fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.pyhave been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):
Root cause: each test builds its input via
torch.randint(...)ortorch.randn(...)with no seed pinned, so each run sees a differentsample. The configured
atolwas tight enough that on some draws theATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.
Fix:
setUp(self): torch.manual_seed(0)so every run sees the sameinput tensor and the same divergence, eliminating the run-to-run
FAIL/PASS oscillation.
divergence with the now-pinned input:
~0.59 for N(0,1) input)
atol=0.25on align_corners_downsampling is leftunchanged - with seed 0 it now passes consistently.
The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in
op_upsample_bilinear2d_aa_test.cppstill validates thekernel under tighter constraints.
Reviewed By: rascani
Differential Revision: D104150928